Efficient and Flexible Index Access in MapReduce
نویسندگان
چکیده
A popular programming paradigm in the cloud, MapReduce is extensively considered and used for “big data” analysis. Unfortunately, a great many “big data” applications require capabilities beyond those originally intended by MapReduce, often burdening developers to write unnatural non-obvious MapReduce programs so as to twist the underlying system to meet the requirements. In this paper, we focus on a class of “big data” applications that in addition to MapReduce’s main data source, require selective access to one or many data sources, e.g., various kinds of indices, knowledge bases, external cloud services. We propose to extend MapReduce with EFind, an Efficient and Flexible index access solution, to better support this class of applications. EFind introduces a standard index access interface to MapReduce so that (i) developers can easily and flexibly express index access operations without unnatural code, and (ii) the EFind enhanced MapReduce system can automatically optimize the index access operations. We propose and analyze a number of index access strategies that utilize caching, re-partitioning, and index locality to reduce redundant index accesses. EFind collects index statistics and performs cost-based adaptive optimization to improve index access efficiency. Our experimental results, using both realworld and synthetic data sets, show that EFind chooses execution plans that are optimal or close to optimal, and achieves a factor of 2x–8x improvements compared to an approach that accesses indices without optimization.
منابع مشابه
FPMR: MapReduce Framework on FPGA A Case Study of RankBoost Acceleration
Machine learning and data mining are gaining increasing attentions of the computing society. FPGA provides a highly parallel, low power, and flexible hardware platform for this domain, while the difficulty of programming FPGA greatly limits its prevalence. MapReduce is a parallel programming framework that could easily utilize inherent parallelism in algorithms. In this paper, we describe FPMR,...
متن کاملScalaGiST: Scalable Generalized Search Trees for MapReduce Systems [Innovative Systems Paper]
MapReduce has become the state-of-the-art for data parallel processing. Nevertheless, Hadoop, an open-source equivalent of MapReduce, has been noted to have sub-optimal performance in the database context since it is initially designed to operate on raw data without utilizing any type of indexes. To alleviate the problem, we present ScalaGiST – scalable generalized search tree that can be seaml...
متن کاملModular Data Clustering - Algorithm Design beyond MapReduce
In the context of Big Data, flexible and adjustable data analytics become more and more important, whereas an efficient, scalable and fault-tolerant execution is required as well. To fulfill the flexibility as well as the execution requirements, the specification of the analysis methods have to be in an appropriate and easy adjustable manner. The MapReduce approach has demonstrated that such fl...
متن کاملComparing Distributed Indexing: To MapReduce or Not?
Information Retrieval (IR) systems require input corpora to be indexed. The advent of terabyte-scale Web corpora has reinvigorated the need for efficient indexing. In this work, we investigate distributed indexing paradigms, in particular within the auspices of the MapReduce programming framework. In particular, we describe two indexing approaches based on the original MapReduce paper, and comp...
متن کاملSorting, Searching, and Simulation in the MapReduce Framework
In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately putting the MapReduce framework on an equal theoretical footing with the well-known PRAM and BSP paralle...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014